Current Issue : July - September Volume : 2012 Issue Number : 3 Articles : 5 Articles
This study proposes a music-aided framework for affective interaction of service robots with humans. The\r\nframework consists of three systems, respectively, for perception, memory, and expression on the basis of the\r\nhuman brain mechanism. We propose a novel approach to identify human emotions in the perception system.\r\nThe conventional approaches use speech and facial expressions as representative bimodal indicators for emotion\r\nrecognition. But, our approach uses the mood of music as a supplementary indicator to more correctly determine\r\nemotions along with speech and facial expressions. For multimodal emotion recognition, we propose an effective\r\ndecision criterion using records of bimodal recognition results relevant to the musical mood. The memory and\r\nexpression systems also utilize musical data to provide natural and affective reactions to human emotions. For\r\nevaluation of our approach, we simulated the proposed human-robot interaction with a service robot, iRobiQ. Our\r\nperception system exhibited superior performance over the conventional approach, and most human participants\r\nnoted favorable reactions toward the music-aided affective interaction....
We address the question of whether and how boosting and bagging can be used for speech recognition. In order to do this, we\r\ncompare two different boosting schemes, one at the phoneme level and one at the utterance level, with a phoneme-level bagging\r\nscheme. We control for many parameters and other choices, such as the state inference scheme used. In an unbiased experiment,\r\nwe clearly show that the gain of boosting methods compared to a single hidden Markov model is in all cases only marginal, while\r\nbagging significantly outperforms all other methods. We thus conclude that bagging methods, which have so far been overlooked\r\nin favour of boosting, should be examined more closely as a potentially useful ensemble learning technique for speech recognition....
The main objective of the work presented in this paper was to develop a complete system that would accomplish\r\nthe original visions of the MALACH project. Those goals were to employ automatic speech recognition and\r\ninformation retrieval techniques to provide improved access to the large video archive containing recorded\r\ntestimonies of the Holocaust survivors. The system has been so far developed for the Czech part of the archive\r\nonly. It takes advantage of the state-of-the-art speech recognition system tailored to the challenging properties of\r\nthe recordings in the archive (elderly speakers, spontaneous speech and emotionally loaded content) and its close\r\ncoupling with the actual search engine. The design of the algorithm adopting the spoken term detection\r\napproach is focused on the speed of the retrieval. The resulting system is able to search through the 1,000 h of\r\nvideo constituting the Czech portion of the archive and find query word occurrences in the matter of seconds.\r\nThe phonetic search implemented alongside the search based on the lexicon words allows to find even the words\r\noutside the ASR system lexicon such as names, geographic locations or Jewish slang....
It has been long speculated that expression of emotions from different modalities have the same underlying\r\nââ?¬Ë?codeââ?¬â?¢, whether it be a dance step, musical phrase, or tone of voice. This is the first attempt to implement this\r\ntheory across three modalities, inspired by the polyvalence and repeatability of robotics. We propose a unifying\r\nframework to generate emotions across voice, gesture, and music, by representing emotional states as a\r\n4-parameter tuple of speed, intensity, regularity, and extent (SIRE). Our results show that a simple 4-tuple can\r\ncapture four emotions recognizable at greater than chance across gesture and voice, and at least two emotions\r\nacross all three modalities. An application for multi-modal, expressive music robots is discussed....
Most of voice activity detection (VAD) schemes are operated in the discrete Fourier transform (DFT) domain by\r\nclassifying each sound frame into speech or noise based on the DFT coefficients. These coefficients are used as\r\nfeatures in VAD, and thus the robustness of these features has an important effect on the performance of VAD\r\nscheme. However, some shortcomings of modeling a signal in the DFT domain can easily degrade the\r\nperformance of a VAD in a noise environment. Instead of using the DFT coefficients in VAD, this article presents a\r\nnovel approach by using the complex coefficients derived from complex exponential atomic decomposition of a\r\nsignal. With the goodness-of-fit test, we show that those coefficients are suitable to be modeled by a Gaussian\r\nprobability distribution. A statistical model is employed to derive the decision rule from the likelihood ratio test.\r\nAccording to the experimental results, the proposed VAD method shows better performance than the VAD based\r\non the DFT coefficients in various noise environments....
Loading....